! huggingface-cli login
Twitter Bot: NLP Emotion Classifier
Python
Deep Learning
NLP
Building and deploying an emotion classifying twitter bot that responds to users who prompt the bot with a # of interest. Bot uses a pretrained BERT encoder fine tuned on a tweet emotion dataset.
! pip install datasets
from datasets import list_datasets
import tensorflow as tf
from transformers import pipeline, PushToHubCallback
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: datasets in /usr/local/lib/python3.7/dist-packages (2.5.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.3.5)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets) (0.70.13)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from datasets) (4.12.0)
Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (4.64.1)
Requirement already satisfied: pyarrow>=6.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (6.0.1)
Requirement already satisfied: xxhash in /usr/local/lib/python3.7/dist-packages (from datasets) (3.0.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.7/dist-packages (from datasets) (3.8.1)
Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.18.0)
Requirement already satisfied: dill<0.3.6 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.5.1)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2.23.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from datasets) (21.3)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (2022.8.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from datasets) (1.21.6)
Requirement already satisfied: huggingface-hub<1.0.0,>=0.1.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.10.0)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.8.1)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.0.2)
Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (2.1.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (6.0.2)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.2.0)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (22.1.0)
Requirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.1.1)
Requirement already satisfied: asynctest==0.13.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (0.13.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (6.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.8.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->datasets) (3.0.9)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (1.25.11)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2022.6.15)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (3.0.4)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->datasets) (3.8.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2022.2.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
= list_datasets()
all_datasets print(all_datasets[0:5])
['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus']
from datasets import load_dataset
= load_dataset('emotion') emotions
/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_deprecation.py:97: FutureWarning: Deprecated argument(s) used in 'dataset_info': token. Will not be supported from version '0.12'.
warnings.warn(message, FutureWarning)
WARNING:datasets.builder:Using custom data configuration default
Downloading and preparing dataset emotion/default (download: 1.97 MiB, generated: 2.07 MiB, post-processed: Unknown size, total: 4.05 MiB) to /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705...
Dataset emotion downloaded and prepared to /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705. Subsequent calls will reuse this data.
= emotions['train']
train_ds train_ds
Dataset({
features: ['text', 'label'],
num_rows: 16000
})
0] train_ds[
{'text': 'i didnt feel humiliated', 'label': 0}
import pandas as pd
type = 'pandas') emotions.set_format(
= emotions['train'][:]
df df
text | label | |
---|---|---|
0 | i didnt feel humiliated | 0 |
1 | i can go from feeling so hopeless to so damned... | 0 |
2 | im grabbing a minute to post i feel greedy wrong | 3 |
3 | i am ever feeling nostalgic about the fireplac... | 2 |
4 | i am feeling grouchy | 3 |
... | ... | ... |
15995 | i just had a very brief time in the beanbag an... | 0 |
15996 | i am now turning and i feel pathetic that i am... | 0 |
15997 | i feel strong and good overall | 1 |
15998 | i feel like this was such a rude comment and i... | 3 |
15999 | i know a lot but i feel so stupid because i ca... | 0 |
16000 rows × 2 columns
def label_int2str(row):
return emotions['train'].features['label'].int2str(row)
'label_name'] = df['label'].apply(label_int2str)
df[ df.head()
text | label | label_name | |
---|---|---|---|
0 | i didnt feel humiliated | 0 | sadness |
1 | i can go from feeling so hopeless to so damned... | 0 | sadness |
2 | im grabbing a minute to post i feel greedy wrong | 3 | anger |
3 | i am ever feeling nostalgic about the fireplac... | 2 | love |
4 | i am feeling grouchy | 3 | anger |
import matplotlib.pyplot as plt
'label_name'].value_counts(ascending=True).plot.barh()
df["Frequency of Classes")
plt.title( plt.show()
'words_per_tweet'] = df['text'].str.split().apply(len)
df['words_per_tweet', by='label_name', grid=False, showfliers=False)
df.boxplot("")
plt.suptitle("")
plt.xlabel( plt.show()
/usr/local/lib/python3.7/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))
emotions.reset_format()
! pip install transformers
from transformers import AutoTokenizer
= 'distilbert-base-uncased'
model_checkpoint = AutoTokenizer.from_pretrained(model_checkpoint) tokenizer
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
|████████████████████████████████| 4.9 MB 34.2 MB/s
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
|████████████████████████████████| 6.6 MB 56.8 MB/s
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (6.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.21.6)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.8.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.10.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2022.6.2)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.64.1)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers) (4.12.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from transformers) (21.3)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0,>=0.9.0->transformers) (4.1.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->transformers) (3.0.9)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers) (3.8.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.25.11)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2022.6.15)
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.12.1 transformers-4.22.2
def tokenize(batch):
return tokenizer(batch['text'], padding=True, truncation=True)
print(tokenize(emotions['train'][0:2]))
{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
= emotions.map(tokenize, batched = True, batch_size = None) emotions_encoded
from transformers import TFAutoModelForSequenceClassification
= 6
num_labels
= TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
tf_model tf_model
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_transform', 'activation_13', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_19', 'pre_classifier', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
<transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertForSequenceClassification at 0x7f50d851c510>
from sklearn.metrics import accuracy_score, f1_score
= tokenizer.model_input_names tokenizer_columns
= 64 batch_size
= emotions_encoded['train'].to_tf_dataset(columns = tokenizer_columns,
tf_train_dataset = ['label'],
label_cols =True, batch_size=batch_size)
shuffle
= emotions_encoded['validation'].to_tf_dataset(columns = tokenizer_columns,
tf_validation_dataset = ['label'],
label_cols =True, batch_size=batch_size) shuffle
= [PushToHubCallback("model_output/",
callbacks =tokenizer,
tokenizer="twitter-emotion-classifier-BERT")]
hub_model_id
compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
tf_model.= tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
loss = tf.metrics.SparseCategoricalAccuracy())
metrics
= tf_validation_dataset, epochs = 2, callbacks=callbacks) tf_model.fit(tf_train_dataset, validation_data
Cloning https://huggingface.co/jakegehri/twitter-emotion-classifier-BERT into local empty directory.
WARNING:huggingface_hub.repository:Cloning https://huggingface.co/jakegehri/twitter-emotion-classifier-BERT into local empty directory.
Epoch 1/2
6/250 [..............................] - ETA: 2:05 - loss: 0.1446 - sparse_categorical_accuracy: 0.9245
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.1977s vs `on_train_batch_end` time: 0.3150s). Check your callbacks.
250/250 [==============================] - 163s 624ms/step - loss: 0.1101 - sparse_categorical_accuracy: 0.9490 - val_loss: 0.1436 - val_sparse_categorical_accuracy: 0.9345
Epoch 2/2
250/250 [==============================] - 136s 545ms/step - loss: 0.0868 - sparse_categorical_accuracy: 0.9599 - val_loss: 0.1442 - val_sparse_categorical_accuracy: 0.9325
Several commits (2) will be pushed upstream.
WARNING:huggingface_hub.repository:Several commits (2) will be pushed upstream.
The progress bars may be unreliable.
WARNING:huggingface_hub.repository:The progress bars may be unreliable.
remote: Scanning LFS files for validity, may be slow...
remote: LFS file scan complete.
To https://huggingface.co/jakegehri/twitter-emotion-classifier-BERT
a929610..8b9eebc main -> main
WARNING:huggingface_hub.repository:remote: Scanning LFS files for validity, may be slow...
remote: LFS file scan complete.
To https://huggingface.co/jakegehri/twitter-emotion-classifier-BERT
a929610..8b9eebc main -> main
<keras.callbacks.History at 0x7f4f960761d0>
"twitter-emotion-classifier-BERT") tf_model.push_to_hub(
= pipeline("text-classification", model = "jakegehri/twitter-emotion-classifier-BERT") classifier
Some layers from the model checkpoint at jakegehri/twitter-emotion-classifier-BERT were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at jakegehri/twitter-emotion-classifier-BERT and are newly initialized: ['dropout_98']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
= "what is going on"
test_tweet = classifier(test_tweet, top_k=6)
preds = emotions['train'].features['label'].names
labels = int(preds[0]['label'].replace("_"," ").split()[1])
emotion_int labels[emotion_int]
'anger'
preds
[{'label': 'LABEL_3', 'score': 0.6134325861930847},
{'label': 'LABEL_4', 'score': 0.3628736138343811},
{'label': 'LABEL_1', 'score': 0.01299766730517149},
{'label': 'LABEL_0', 'score': 0.008490157313644886},
{'label': 'LABEL_5', 'score': 0.0016536037437617779},
{'label': 'LABEL_2', 'score': 0.0005523563013412058}]
= []
rank
for i in preds:
= i['label']
label int(i['label'].replace("_"," ").split()[1])) rank.append(
= []
re_rank for i in rank:
re_rank.append(labels[i])
re_rank
['anger', 'fear', 'joy', 'sadness', 'surprise', 'love']
= pd.DataFrame(preds)
preds_df 100 * preds_df['score'], color = 'C0')
plt.bar(re_rank, f'"{test_tweet}"')
plt.title( plt.show()